Estimation and Optimization of Composite Outcomes

There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often seek to balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.


Introduction
Precision medicine is an approach to healthcare that involves tailoring treatment based on individual patient characteristics (Hamburg and Collins, 2010;Collins and Varmus, 2015). Accounting for heterogeneity by tailoring treatment has the potential to improve patient outcomes in many therapeutic areas. An individualized treatment rule formalizes precision medicine as a map from the space of patient covariates into the space of allowable treatments (Murphy, 2003;Robins, 2004). Almost all methods for estimating individualized treatment rules have been designed to optimize a scalar outcome (exceptions will be discussed shortly). However, in practice, clinical decision making often requires balancing trade-offs between multiple outcomes. For example, clinicians treating patients with bipolar disorder must manage both depression and mania. Antidepressants may help correct depressive episodes but may also induce manic episodes (Sachs et al., 2007;Ghaemi, 2008;Goldberg, 2008;Wu et al., 2015). We propose a novel framework for using observational data to estimate a composite outcome and the corresponding optimal individualized treatment rule.
The estimation of optimal individualized treatment rules has been studied extensively, leading to a wide range of estimators designed to suit an array of data structures and data-generating processes (Kosorok and Laber, 2019; Tsiatis et al., 2020). These estimators include: regression-based methods like Q-learning (Murphy, 2005;Qian and Murphy, 2011;Schulte et al., 2014;Laber et al., 2014a), A-learning (Murphy, 2003;Robins, 2004;Blatt et al., 2004;Moodie et al., 2007;Wallace and Moodie, 2015), and regret regression (Henderson et al., 2010); direct-search methods (Rubin and van der Laan, 2012;Zhang et al., 2012b;Zhao et al., 2012;Zhang et al., 2013;Zhou et al., 2017) based on inverse probability weighting (Robins, 1999;Murphy et al., 2001;van der Laan and Petersen, 2007;Robins et al., 2008); and hybrid methods (Taylor et al., 2015;Zhang et al., 2018). The preceding methods require specification of a single scalar outcome that will be used to define an optimal regime; were individual patient utilities known, then they could be used as the outcome in any of these methods. However, in general such utilities are not known though they can be elicited provided a high-quality instrument is available (Butler et al., 2018); in the absence of such an instrument, preference elicitation is difficult to apply. A method for constructing a composite utility that is best predicted using a non-parametric machine learning model is proposed by (Benkeser et al., 2020); howeer, they do not consider heterogeneous utilities or the construction of precision medicine strategies. 1 We propose a new paradigm for estimating optimal individualized treatment rules from observational data without eliciting patient preferences. The key premise is that clinicians are attempting to act optimally with respect to each patient's utility and thus the observed treatment decisions contain information about individual patient utilities. This idea is similar to that introduced by Wallace et al. (2018) (see also Wallace et al., 2016); however, we provide an estimator for the probability that a patient is treated optimally, rather than assuming that all patients are treated optimally. We construct estimators of individual patient utilities which do not require that clinicians are acting optimally, only that they approximately follow an optimal policy. This approach allows us to describe the goals of the decision maker and how these goals vary across patients, determine what makes a patient more or less likely to be treated optimally under standard care, and estimate the decision rule which optimizes patient-specific composite outcomes. We develop this approach in the context of a single-stage, binary decision in the presence of two outcomes. An extension to the setting with more than two outcomes is discussed in the Appendix.
Other methods for estimating optimal treatment rules in presence of multiple outcomes include using an expert-derived composite outcome for all patients (Thall et al., 2002(Thall et al., , 2007Murray et al., 2016;Moser et al., 2020). However, this does not account for differences in the utility function across patients and in some cases it may not be possible to elicit a high-quality composite outcome from an expert. Alternatively, multiple outcomes can be incorporated using set-valued treatment regimes (Laber et al., 2014b;Lizotte and Laber, 2016;Wu, 2016), constrained optimization (Linn et al., 2015;Laber et al., 2018;Wang et al., 2018), or inverse preference elicitation (Lizotte et al., 2012). Schnell et al. (2017) extend methods for estimating the benefiting subgroup to the case of multiple outcomes using the concept of admissibility (see also Schnell et al., 2016). However, none of these approaches provide a method for estimating an individual patient's utility.
This work is closely related to inverse reinforcement learning (Kalman, 1964;Ng et al., 2000;Abbeel and Ng, 2004;Ratliff et al., 2006), which involves studying decisions made by an expert and constructing the utility function that is optimized by the expert's decisions. Inverse reinforcement learning has been successfully applied in navigation (Ziebart et al., 2008) and human locomotion (Mombaur et al., 2009). Inverse reinforcement learning methods assume that decisions are made in a single environment. However, in the context of precision medicine, both the utility function and the probability of optimal treatment may vary across patients. Our approach is a version of inverse reinforcement learning with multiple environments.
This work is also related to the notion of stated and revealed preferences in the health economics literature. 2 Viewed through this lens, our work might be characterized as using clinical decisions as a kind of surrogate for patient revealed preferences thereby avoiding the need for the elicitation of stated preferences using specialized instruments. This is advantageous as the construction of high-quality instruments is difficult and collection of preference information is not routine in many areas (Carlsson and Martinsson, 2003;Ryan et al., 2007;de Bekker-Grob et al., 2012;Soekhai et al., 2019); though see Butler et al. (2018) for an illustrative application when such an instrument is available. Challenges associated with preference elicitation for precision medicine are discussed in Laber et al. (2014b), Lizotte andLaber (2016).
In Section 2, we introduce a pseudo-likelihood method to estimate patient utility functions from observational data. In Section 3, we state a number of theoretical results pertaining to the proposed method, including consistency and inference for the maximum pseudolikelihood estimators. Section 4 presents a series of simulation experiments used to evaluate the finite sample performance of the proposed methods. Section 5 presents an illustrative application using data from the STEP-BD bipolar disorder study. Conclusions and a discussion of future research are given in Section 6. Proofs are given in the appendix along with additional simulation results and a discussion of an extension to more than two outcomes.

Pseudo-likelihood Estimation of Utility Functions
Assume the available data are (X i , A i , Y i , Z i ), i = 1, …, n, which comprise n independent and identically distributed copies of (X, A, Y, Z), where X ∈ X ⊆ ℝ p are patient covariates, A ∈ A = − 1, 1 is a binary treatment, and Y and Z are two real-valued outcomes for which higher values are more desirable. The extension to scenarios with more than two outcomes is discussed in the Appendix. An individualized treatment rule is a function d: X A such that, under d, a patient presenting with covariates X = x will be assigned to treatment d(x). Let Y*(a) denote the potential outcome under treatment a ∈ A, and for any regime d, define Y *(d) = ∑ a ∈ A Y *(a)1 d(X) = a . An optimal regime for the outcome Y, say d Y opt , satisfies EY * d Y opt ≥ EY *(d) for any other regime d. The optimal regime for the outcome Z, say d Z opt , is defined analogously. In order to identify these optimal regimes, and subsequently to identify the optimal regime across the class of utility functions introduced below, we make the following assumptions. In addition we assume that there is no interference between units nor are the multiple versions of treatment (Rubin, 1980). These assumptions are standard in causal inference (Robins, 2004;Hernan and Robins, 2010). Assumption 3 is not empirically verifiable in observational studies (Rosenbaum and Rubin, 1983;Rosenbaum, 1984). We assume that clinicians act with the goal of optimizing each patient's utility and that their success in identifying the optimal treatment depends on individual patient characteristics. Therefore, we assume that the clinicians are approximately, i.e., imperfectly, assigning treatment according to d U opt (x). If the clinician were always able to correctly identify the optimal treatment and assign A = d U opt (X) for each patient, there would be no need to estimate the optimal treatment policy (Wallace et al., 2016). Instead, we assume that the decisions of the clinician are imperfect and that Pr A = d U opt (x) X = x = expit x ⊤ β where β is an unknown parameter. We show in Section 2.2 that the model is identifiable under mild conditions; e.g., these exclude the possibility of a malevolent clinician that is systematically assigning poor treatments. We implicitly assume throughout that X may contain higher order terms, interactions, or basis functions constructed from patient covariates.

Fixed Utility
We begin by assuming that the utility function is constant across patients and takes the form u(y, z; ω) = ωy + (1 − ω)z for some ω ∈ [0, 1]. Lemma 1 of Butler et al. (2018) states that, for a broad class of utility functions, the optimal individualized treatment rule is equivalent to the optimal rule for a utility function of this form. Define Q ω (x, a) = ωQ Y (x, a) + (1 − ω)Q Z (x, a) and define d ω opt (x) = arg max a ∈ A Q ω (x, a). Let Q Y , n and Q Z, n denote estimators of Q Y and Q Z obtained from regression models fit to the observed data (Qian and Murphy, 2011). For a fixed value of ω, let Q ω, n (x, a) = ωQ Y , n (x, a) + (1 − ω)Q Z, n (x, a) and subsequently let d ω, n (x) = arg max a ∈ A Q ω, n (x, a) be the plug-in estimator of d ω opt (x). Given Q Y , n and Q Z, n , d ω, n (x) can be computed for each ω ∈ [0, 1].
The joint distribution of (X, A, Y, Z) is which depends on the unknown function d ω opt . Plugging in d ω, n for d ω opt into (7) yields the pseudo-likelihood If we let ω n and β n denote the maximum pseudo-likelihood estimators obtained by maximizing (2), then an estimator of the utility function is u n (y, z) = u y, z; ω n = ω n y + 1 − ω n z and expit x ⊤ β n is an estimator of the probability that a patient presenting with covariates x would be treated optimally under standard care. An estimator of the optimal policy at x is d ω n , n (x) = arg max a ∈ A Q ω n , n (x, a).
Because the pseudo-likelihood given in (2) is non-smooth in ω, standard gradient-based optimization algorithms cannot be used. However, for a given ω, it is straightforward to compute the profile estimator β n (ω) = arg max β ∈ ℝ p ℒ n (ω, β). We can compute the profile pseudo-likelihood estimator over a grid of values for ω and select the point on the grid yielding the largest pseudo-likelihood. The algorithm to construct (ω n , β n ) is given in Algorithm 1 below.
Step (3) can be accomplished using logistic regression. The theoretical properties of this estimator are discussed in Section 3.

Patient-specific Utility
Outcome preferences can vary widely across patients in some application domains, including schizophrenia (Kinter, 2009;Strauss et al., 2010) and pain management (Gan et al., 2004). To accommodate this setting, we assume that the utility function takes the form u(y, z; x, ω) = ω(x)y + {1 − ω(x)}z where ω: X [0, 1] is a smooth function. For illustration, we let ω(x; θ) = expit(x ⊤ θ) where θ is an unknown parameter. Misspecified utility models are discussed in the Appendix. Define Q θ (x, a) = ω(x; θ)Q Y (x, a) + {1 − ω(x; θ)}Q Z (x, a) and define d θ opt (x) = arg max a ∈ A Q θ (x, a). Let Q Y , n and Q Z, n denote estimators of Q Y and Q Z obtained from regression models fit to the observed data. For a fixed value of θ, let Q θ, n (x, a) = ω(x; θ)Q Y , n (x, a) + 1 − ω(x; θ) Q Z, n (x, a) and subsequently let d θ, n (x) = arg max a ∈ A Q θ, n (x, a) be the plug-in estimator of d θ opt (x). Assume that decisions are made according to the model Pr A = d θ opt (x) X = x = expit x ⊤ β . We compute the estimators (θ n , β n ) of (θ, β) by maximizing the pseudo-likelihood ℒ n (θ, β) ∝ ∏ i = 1 n exp X i ⊤ β1 A i = d θ, n X i 1 + exp X i ⊤ β . (3) An estimator for the utility function is u n (y, z; x) = ω x; θ n y + 1 − ω x; θ n z and an estimator for the optimal decision function is d θ n , n . The model, as stated is not identifiable.
However, we show below that it is identifiable under the following conditions.

Assumption 4
The following conditions hold.
proposal distribution, σ 2 , so that the acceptance proportion is between 0.25 and 0.5 (Geyer and Johnson, 2017).

Theoretical Results
Here we state a number of theoretical results pertaining to the proposed pseudo-likelihood estimation method for utility functions. We state results for a patient-specific utility function; the setting where the utility function is fixed is a special case. All proofs are deferred to the Appendix.
We assume that Pr A = d U opt (x) X = x = expit x ⊤ β 0 and that the true utility function is u(y, z; x, θ 0 ) = ω (X; θ 0 )y + {1 − ω (X; θ 0 )}z, where ω(X; θ) has bounded continuous derivative on compact sets and d θ 0 opt (X) = d θ opt (X) almost surely implies θ = θ 0 , i.e., the model introduced in Section 2.2 is well-defined and correctly specified with true parameters β 0 ∈ ℝ p and θ 0 ∈ ℝ d . We further assume that the estimators Q Y , n (x, a) and Q Z, n (x, a) are pointwise consistent for all ordered pairs (x, a). Along with Assumptions 1-3, we implicitly assume that the densities f(Y, Z|X, A) and f(X) exist. The following result states the consistency of the maximum pseudo-likelihood estimators for the utility function and the probability of optimal treatment. The proof involves verifying the conditions of Theorem 2.12 of Kosorok (2008).
Let V θ (d) = E u(Y , Z; X, θ) A = d(X) be the mean composite outcome in a population where decisions are made according to d. The following result establishes the consistency of the value of the estimated optimal policy. The proof uses general theory developed by Qian and Murphy (2011).
Theorem 8 (Value consistency with patient-specific utility) Let θ n be the maximum pseudo-likelihood estimator for θ and let d θ n , n be the associated estimated optimal policy.

Assumption 10
The following conditions hold.

2.
The conditional distribution of X given that D θ 0 (X) ≤ ϵ converges to a nondegenerate distribution as ϵ ↓ 0;

3.
There exist δ 1 , δ 2 > 0 such that where S d is the d-dimensional unit sphere.
Assumption 11 Define, for Z Y ∈ ℝ q 1 , Z Z ∈ ℝ q 2 , and U ∈ ℝ d , Remark 12 Assumption 9 establishes a rate of convergence for the estimated Q-functions and is automatically satisfied if the Q-functions are estimated using linear or generalized linear models with or without interactions or higher order terms. Assumption 10 is needed to ensure that there is positive probability of patients with x values near the boundary between where each treatment is optimal. Assumption 11 is standard in M-estimation.
Let (θ n , β n ) be the maximum pseudo-likelihood estimators given in Section 2.2. The following theorem states the asymptotic distribution of (θ n , β n ).

Theorem 13 (Asymptotic distribution) Under the given regularity conditions
Let Z * P denote convergence in probability over Z*, as defined in Section 2.

and Chapter
Theorem 14 (Parametric bootstrap) Assume Σ n = Σ 0 + o P (1) and ℎ n = v n n −1/5 , where v n P v 0 ∈ (0, ∞) and v 0 is the standard error of D θ 0 (X). Assume the regularity conditions given above hold. Let Z* N 0, I r × r , where I r×r is an r × r identity matrix and r = q + p. Let Σ 1 is the top left q × q block of Σ n (corresponding to Z Y and Z Z ), Σ 2 is the lower right p × p block, Σ 21 . is the upper right q × p block, Σ 12 = Σ 21 ⊤ , and the matrix square roots are the symmetric square roots obtained from the associated Eigenvalue decompositions. Let where ϕ 0 is the standard normal density. Define U n = arg min u ∈ ℝ d β n ⊤ k n Z Y , Z Z , u and B n = I n β n −1 Z A − k n Z Y , Z Z , U n . Then, where (U ⊤ , B ⊤ ) ⊤ is as defined in Theorem 13.
If we fix a large number of bootstrap replications, B, then U n, b , B n, b , b = 1, …, B will provide an approximation to the sampling distribution of the maximum pseudo-likelihood estimators. In Sections 4 and 5, we demonstrate the use of the bootstrap to test for heterogeneity of patient preferences.

Remark 15
In Theorem 13, it can be seen that when β 0 only involves an intercept, there is no relationship between β 0 and U, as the argmax of an objective function does not change under multiplication by a positive scalar. This relationship is more complex when β 0 includes covariate effects. Theorem 13 also indicates that the asymptotic behaviors of θ and β are driven largely by what happens at the boundary where D θ 0 (X) = 0.

Fixed Utility Simulations
To examine the finite sample performance of the proposed methods, we begin with the following simple generative model. Let X = (X 1 , …, X 5 ) ⊤ be a vector of independent normal random variables with mean 0 and standard deviation 0.5. Let treatment be assigned according to Pr A = d ω opt (x) X = x = ρ, i.e., the probability that the clinician correctly identifies the optimal treatment is constant across patients. Let ϵ Y and ϵ Z be independent normal random variables with mean 0 and standard deviation 0.5 and let Y = A (4X 1 − 2X 2 + 2) + ϵ Y and Z = A (2X 1 − 4X 2 − 2) + ϵ Z . We estimated Q Y and Q Z using linear models, implemented the proposed method for a variety of n, ω, and ρ values, and examined ω n , ρ n , and d ω n , n , across 500 Monte Carlo replications per scenario. Table 1 contains mean estimates of ω and ρ across replications along with the associated standard deviation across replications, and estimated error rate defined as the proportion of subjects to whom the estimated optimal policy does not recommend the true optimal treatment; to better characterize sampling variability in the estimated error rate the last column displays the median along with the first and third quartiles of the sampling distribution of the estimated error rate.
The pseudo-likelihood method performs well at estimating both ω and ρ, with estimation improving with larger sample sizes as expected. Table 2 contains estimated values of the true optimal policy, a policy where the utility function is estimated (the proposed method), policies estimated to maximize the two outcomes individually (corresponding to fixing ω = 1 and ω = 0), and the standard of care. The value of the standard of care is the mean composite outcome under the generative model. For each policy, the value is estimated by generating a testing sample of size 500 with treatment assigned according to the policy and averaging utilities (calculated using the true ω) in the testing set. The standard deviation across replications is included in parentheses.
The column labeled "estimated ω" refers to the proposed method. We see that the proposed method produces values which increase with n and generally come close to the true optimal policy. In all settings, the proposed method offers significant improvement over the standard of care. The proposed method also offers improvement over policies to maximize each individual outcome.
To further examine the performance of the proposed method, we allow the probability of optimal treatment to depend on patient covariates. Let Pr A = d ω opt (X) = expit 0.5 + X 1 . This corresponds to the case where β = (0.5, 1, 0, …, 0) ⊤ , where the first element of β is an intercept. Let X, Y, and Z be generated as described above. In this generative model, the probability that a patient is treated optimally in standard care is larger for patients with positive values of X 1 and smaller for patients with negative values of X 1 . We applied the proposed method to 500 replications of this generative model for various n and ω. Table 3 contains mean estimates of ω, root mean squared error (RMSE) of β n , and the error rate along with its standard error and quartiles. Estimation of the observational policy (as defined by β) improves with larger sample sizes.
The probability that the estimated policy assigns the optimal treatment also increases with the sample size. The true value of ω does not affect estimation of ω or β. Table 4 contains estimated values of the true optimal policy, a policy where the utility function is estimated (the proposed method), policies estimated to maximize each outcome individually, and the standard of care. Values are estimated from independent testing sets of size 500 as described above. The value under the standard of care is the mean composite outcome under the generative model.
The proposed method (found in the column labeled "estimated ω") produces values that are close to the true optimal policy in large samples and a significant improvement over standard of care in small to moderate samples. We note that value under the standard of care differs across ω. When ω is close to 1, the composite outcome places more weight on Y, for which the magnitude of the association with X 1 is larger. Because patients with larger values of X 1 are more likely to be treated optimally in this generative model, the standard of care produces larger composite outcomes when ω is closer to 1. Likewise, the mean composite outcome under policies to maximize each individual outcome varies with the true value of ω.

Patient-specific Utility Simulations
Next, we examine the case where the utility function is allowed to vary across patients. Let X, Y, and Z be generated as above. Again, assume that Pr A = d θ opt (X) = expit 0.5 + X 1 , i.e., where ω(X; θ) = expit(1 − 0.5X 1 ), i.e., θ = (1, −0.5, 0, …, 0) ⊤ , where the first element of θ is an intercept. We implemented the proposed method for various n and examined estimation of θ and β across 500 replications. Each replication is based on a simulated Markov chain of length 10,000 as described in Section 2.2. Results are summarized in Table 5.
Larger sample sizes produce marginal decreases in the RMSE of θ n . The estimated policy assigns the true optimal treatment more than 80% of the time for all sample sizes and the error rate decreases as the sample size increases. Table 6 contains estimated values of the true optimal policy, the policy estimated using the proposed method, policies estimated to maximize each outcome individually, and standard of care.
The proposed method produces policies that achieve significant improvement over the standard of care across sample sizes.
Finally, we examine the performance of the parametric bootstrap as described in Section 3. Let X be a bivariate vector of normal random variables with mean 0, standard deviation 0.5, and correlation zero. Let Y and Z be generated as above and let β = (2.5, 1, 0) ⊤ where the first element of β is an intercept. Let θ (1) be the vector θ with the first element removed. We are interested in testing the null hypothesis H 1 : ∥θ (1) ∥ = 0, which corresponds to a test for heterogeneity of patient preferences. The table below contains estimated power across 500 Monte Carlo replications under the null hypothesis, where the true value is θ = (1, 0, 0) ⊤ , and two alternative hypotheses: H 1 : θ = (1, 4, 3) ⊤ , and H 2 : θ = (1, 6, 6) ⊤ . All tests were conducted at level α = 0.05 and based on 1000 bootstrap samples. The last column in Table 7 shows the average agreement between the bootstrap and estimated optimal decision rule when θ 0 = (1, 0, 0); the results suggest the decision rule is stable at sample sizes and generative models considered.
The proposed bootstrap procedure produces type I error rates near nominal levels under the null and moderate power in large samples under alternative hypotheses.

Case Study: The STEP-BD Standard Care Pathway
The Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) was a landmark study of the effects of antidepressants in patients with bipolar disorder (Sachs et al., 2007). In addition to a randomized trial assessing outcomes for patients given an antidepressant or placebo, the STEP-BD study also included a large-scale observational study, the standard care pathway. As our method requires observational data on clinical decision making, we apply the proposed method to the observational data from the STEP-BD standard care pathway to estimate decision rules for the use of antidepressants in patients with bipolar disorder. (Clearly, as clinicians are not generally assigning treatment according to their best clinical judgment in a randomized clinical trial, the proposed method is not applicable to the randomized pathway of STEP-BD.) Although bipolar disorder is characterized by alternating episodes of depression and mania, recurrent depression is the leading cause of impairment among patients with bipolar disorder (Judd et al., 2002). However, the use of antidepressants has not become standard care in bipolar disorder due to the risk of antidepressants inducing manic episodes in certain patients (Ghaemi, 2008;Goldberg, 2008). Thus, the clinical decision in the treatment of bipolar disorder is whether to prescribe antidepressants to a specific patient in order to balance trade-offs between symptoms of depression, symptoms of mania, and other side effects of treatment.
We use the SUM-D score for depression symptoms and the SUM-M score for mania symptoms as outcomes. We consider a patient treated if they took any one of ten antidepressants that appear in the STEP-BD standard care pathway (Deseryl, Serzone, Citalopram, Escitalopram Oxalate, Prozac, Fluvoxamine, Paroxetine, Zoloft, Venlafaxine, or Bupropion). To generate candidate predictors for our model we made use of a complimentary randomized pathway in the STEP-BD trial. In this pathway, the patients are drawn from the same population, and the same variables are measured; however, treatment is randomly assigned so that there is no unmeasured confounding. Using step-wise variable selection to construct an outcome model from these data identified the following variables: mood elevation, anxiety, irritability, baseline SUM-M, and baseline SUM-D. We also used a step-wise logistic regression for the propensity score in the observational pathway to identify any additional potential confounders (Moodie et al., 2012). In addition to the variables in the outcome model, the logistic regression model identified race, insurance status, age, and substance abuse. The union of variables identified in through the randomized pathway and the propensity score were used in our models of the Q-functions and as tailoring variables in our treatment rules. Figure 1 contains box plots of SUM-D scores on the log scale by substance abuse and treatment. Figure 2 contains box plots of SUM-M scores on the log scale by substance abuse and treatment. For both outcomes, lower scores are more desirable. Figure 1 indicates that those without a history of substance abuse benefit from treatment with antidepressants. However, among those with a history of substance abuse, patients treated with antidepressants appear to have worse symptoms of depression. Figure 2 indicates that treatment has no effect on symptoms of mania among those without a history of substance abuse. However, among those with a history of substance abuse, it appears that treatment may be inducing manic episodes. Thus, a sensible treatment policy would be one that tends to prescribe antidepressants only to patients without a history of substance abuse.
We analyzed these data using the proposed method for optimizing composite outcomes. Results are summarized in Table 8 below. We estimated policies where both utility and probability of optimal treatment are fixed (fixed-fixed), where utility is fixed but probability of optimal treatment is assumed to vary between patients (fixed-variable), and where both utility and probability of optimal treatment are assumed to vary between patients (variable-variable). For both the fixed-variable policy and the variable-variable policy, we report E n expit X ⊤ β n in place of ρ n and for the variable-variable policy, we report E n expit X ⊤ θ n in place of ω n . Thus, for parameters that are assumed to vary across patients, Table 8 contains the mean estimate in the sample. To evaluate each estimated policy, we used five-fold cross-validation of the inverse probability weighted estimator (IPWE) of the value for each outcome; i.e., for each fold, we used the training portion to estimate the optimal policy and propensity score, and we used the testing portion to compute the IPWE of the value; taking the average of the IPWE value estimates across folds yields the reported values. For both SUM-D and SUM-M, lower scores are preferred. Value is reported as the percent improvement over standard of care, calculated using the estimated utility function. Large percent improvements in value are preferred.
All estimated policies produce more desirable SUM-D scores and SUM-M scores compared to standard of care and improved value according to the estimated utility. Allowing the probability of optimal treatment to vary between patients leads to further improvements in value, as does allowing the utility function to vary between patients. All policies produce similar estimates for the probability of optimal treatment averaged across patients.
The resulting decision rules can be written as the sign of a linear combination of the covariates. As an example, the fixed-fixed policy assigns treatment with antidepressants when 0.032 − 0.001(age) − 0.646(substance abuse) − 0.007(mood elevation) + 0.007(medical insurance) + 0.129(white) is non-negative. The negative coefficient for substance abuse means that a history of substance abuse indicates that a patient should not be prescribed antidepressants. Prior research has shown that patients with a history of substance abuse are more likely to abuse antidepressants (Evans and Sullivan, 2014). This may contribute to the poor outcomes experienced by STEP-BD patients with a history of substance abuse who were treated with antidepressants. Table 9 displays estimates and standard errors of the components of θ n in the variable-variable policy. A test for preference heterogeneity based on 1000 bootstrap samples generated according to Theorem 14 yielded a p-value < 0.001.
As a secondary analysis, we use the SUM-D score and a side effect score as the outcomes. Eight side effects were recorded in the STEP-BD standard care pathway (tremors, dry mouth, sedation, constipation, diarrhea, headache, poor memory, sexual dysfunction, and increased appetite). Patients rated the severity of each side effect from 0 to 4 with larger values indicating more severe side effects. We took the mean score across side effects as the second outcome. Results are summarized in Table 10, reported analogously to those in Table  8.
Each estimated policy produces improved SUM-D scores and improved side effect scores compared to the standard of care. Each policy also produces improvement in value according to the estimated utility function. Again, allowing the utility function to vary between patients results in further improvements in value. Each policy produces similar estimates of the probability that patients are treated optimally in standard care. The variable-variable policy places more weight on SUM-D scores on average compared to the other policies. Table  11 displays estimates and standard errors of coefficients in θ n in the variable-variable policy. The bootstrap procedure for testing the null hypothesis that patient preferences are homogeneous based on 1000 bootstrap samples yielded a p-value < 0.001.

Discussion
The estimation of individualized treatment rules has been well-studied in the statistical literature. Existing methods have typically defined the optimal treatment rule as optimizing the mean of a fixed scalar outcome. However, clinical practice often requires consideration of multiple outcomes. Thus, there is a disconnect between existing statistical methods and current clinical practice. It is reasonable to assume that clinicians make treatment decisions for each patient with the goal of maximizing that patient's utility. Therefore, it is natural to use observational data to estimate patient utilities from observed clinician decisions. This represents a new paradigm for the use of observational data in the context of precision medicine in that clinical decisions are viewed as a (noisy) surrogate for patient preferences and leveraged to improve the quality of a learned treatment rule and to generate new insights into heterogeneity in patient preferences.
The proposed methodology offers many opportunities for future research. In the present manuscript, we have considered only the simplest case-that of one decision time, two outcomes, and two possible treatments. Scenarios with more than two outcomes are discussed in the Appendix, and the simulation results there demonstrate that the proposed method performs well with three outcomes. Extensions to more than two treatments or multiple time points represent potential areas for future research. The proposed method requires positing a parametric model for the utility function. Model misspecification is discussed in the Appendix, and the simulation results there demonstrate that the proposed method performs reasonably well when important covariates are omitted from the model for the utility function. However, the use of semi-or non-parametric models is an important extension. A more technical direction for future work is a more nuanced study of the affect of boundary conditions on the resulting rate of convergence (see Assumption 10).
Finally, while we have proposed our utility function estimator inside the framework of onestage Q-learning, the pseudo-likelihood utility function estimator could be used alongside other existing one-stage optimal treatment policy estimators based on (augmented) inverse probability weighting (e.g., Zhao et al., 2012;Zhang et al., 2012a). There is a great future for the development of methods for optimizing composite outcomes in precision medicine and application of these methods in clinical studies.
Let u(X, A; θ) = ω(X; θ) Q Y , n (X, A) − Q Z, n (X, A) + Q Z, n (X, A), which lies in a VC class indexed by θ ∈ ℝ p by Lemma 9.6 and Lemma 9.9 (viii), (vi), and (v) of Kosorok (2008). We have that and it follows that 1 A = d θ, n (X) is contained in a GC class indexed by θ ∈ ℝ p . From Corollary 9.27 (ii) of Kosorok (2008) it follows that X ⊤ β1 A = d θ, n (X) lies in a GC class indexed by (θ, β) ∈ ℝ p × ℬ as long as X ⊤ β is uniformly bounded by a function with finite mean, which holds as long as ℬ is compact and EX < ∞. It follows that sup (θ, β) ∈ ℝ p × ℬ E n − E X ⊤ β1 A = d θ, n (X) − log 1 + exp X ⊤ β P 0.
Thus, we also have that Define U 0 Z Y , Z Z = arg max u ∈ ℝ d β 0 ⊤ k 0 Z Y , Z Z , u . Previous arguments yield that By Assumption 11, the arg min for each Z Y ⊤ , Z Z ⊤ ⊤ ∈ K 1 is unique. Fix ϵ > 0. By (11), there exists an m 2 < ∞ such that Pr sup Z Y ⊤ , Z Z ⊤ ⊤ ∈ K 1 U n Z Y , Z Z < m 2 ≥ 1 − ϵ for all n large enough. By (13), we can enlarge m 2 such that We can also find an m 1 < ∞ such that K 1 ⊂ K m 1 q as defined in Corollary 17.
It is straightforward to show that (1) and (3) in Corollary 17 are satisfied by ⊤ ⊤ . Let f n (Z, u) = β n ⊤ k n Z Y , Z Z , u . Standard arguments and the given assumptions yield that there exists a w 1 < ∞ such that sup Z ∈ K m 1 q sup u ∈ K m 2 d f n (Z, u) < w 1 almost surely and sup Z 1 , Z 2 ∈ K m1 q : Z 1 − Z 2 < δ f n Z 1 , u − f n Z 2 , u K m 2 d < w 1 δ for all δ > 0 and all n ≥ 1 almost surely. Every subsequence in (12) has a further subsequence n″ on which the convergence in probability to zero can be replaced with almost sure convergence. Thus, (2) and (4) of Corollary 17 apply, using the fact that minimizing is equivalent to maximizing after a change in sign. Setting sup Z Y ⊤ , Z Z ⊤ ⊤ ∈ K m 1 q U n″ * Z Y , Z Z − U 0 Z Y , Z Z 0 almost surely. Since this is true for every subsequence, we have that Since ϵ was arbitrary, we obtain that sup Z Y ⊤ , Z Z ⊤ ⊤ ∈ K m 1 q U n Z Y , Z Z − U 0 Z Y , Z Z = o P (1) .
Let B L (B) be the space of all Lipschitz continuous functions mapping B ℝ which are bounded by 1 and which have Lipschitz constant 1. Let E Z be expectation with respect to Z 0 * = Z Y *, Z Z * , Z A * ⊤ ⊤ N 0, Σ 0 . Let B 0 Z 0 * = I 0 −1 Z A * − k 0 Z Y *, Z Z * , U n Z 0 * and let f ∈ B L ℝ d + p . Then, using U n and B n as defined in the statement of the theorem, where we define both A n = sup f ∈ B L ℝ d E Z f U n Z 0 * − E Z f U 0 Z 0 * and which implies that A n = o P (1) since was arbitrary. For K 2 ⊂ ℝ q + p such that Pr Z 0 * ∈ K 2 ≥ 1 − ϵ, previous arguments yield that As before, we can argue that B n = o P (1) + 2ϵ, which implies that B n = o P (1) since ϵ was arbitrary. The result follows.■

Theorem 16
Let H be a compact set in a metric space with metric d and let ℱ be a compact subset of C[H] with respect to the uniform norm, ∥ · ∥ H . For each f ∈ ℱ, let u(f) = arg max ℎ ∈ H f(ℎ), where, when the arg max is not unique, we select one element of the arg max set either randomly or algorithmically. Suppose also that there exists a closed ℱ 1 ⊂ ℱ for which each f ∈ ℱ 1 has a unique maximum. Then, Proof [Proof of Theorem 16] Fix ϵ > 0. For each f ∈ ℱ 1 , there exists δ f > 0 such that is the open d-ball of radius ϵ around u. This follows since the compactness of ℱ ensures that all f ∈ ℱ are continuous. Let g ∈ ℱ be such that ∥f -g∥ H < δ f . Then, f {u(g)} > g {u(g)} − δ f ≥ g {u(f)} − δ f > f {u(f)} − 2δ f , which implies that d{u(g), u(f)} < ϵ. We have that ∪ f ∈ ℱ 1 g ∈ ℱ: g − f H < δ f is an open cover of ℱ 1 . Since ℱ 1 is compact, there exists a set ℱ 1 ϵ such that ℱ 1 ϵ is finite and ∪ f ∈ ℱ 1 ϵ g ∈ ℱ: g − f H < δ f still covers ℱ 1 . Let f n ∈ ℱ and g n ∈ ℱ be sequences such that ∥f n − g n ∥ → 0. By compactness, every subsequence has a further subsequence n″ such that f n′′ f 0 ∈ ℱ 1 and g n′′ g 0 ∈ ℱ so that both f 0 and g 0 are in a set of the form g ∈ ℱ: g − f H < δ f for some f ∈ ℱ 1 ϵ . This implies that d{u(g 0 ), u(f 0 )} < ϵ. Since the subsequence was arbitrary, we have that lim sup n→∞ d{u(g n ), u(f n )} ≤ ϵ. Since ϵ was arbitrary, we now have that lim sup n→∞ d{u(g n ), u(f n )} = 0 for any sequences f n ∈ ℱ 1 and g n ∈ ℱ such that ∥f n − g n ∥ → 0. This proves the result.■
Because sup z ∈ K m 1 q f n (z, ⋅ ) − f(z, ⋅ ) K m 2 d < δ for all n large enough, the result follows.■ Proof [Remark 6.] We require estimation be restricted to parameters (β, θ) which satisfy P β {A = d θ (X)|X} > 1/2 with probability one. Suppose towards a contradiction that such a set of parameters also satisfies where P β (x) = expit(x ⊤ β). For any (x, a) such that a = d θ (x) ≠ d θ 0 (x) and it follows that P β (x) = P −β 0 (x) < P β 0 (x) = P −β (x) which contradicts the restriction that the probability of recommending an optimal treatment exceeds 1/2. Thus, it follows that d θ (x) = d θ 0 (x) for almost all x. This in turn implies that P β(x) = P β 0 (x) for almost all x from which it follows that β = β 0 provided EXX ⊤ is full rank.■

Appendix B: Three or More Outcomes
Assume the available data consist of (X i , A i , Y 1,i , …, Y K,i ), i = 1, …, n, which comprise n independent and identically distributed copies of (X, A, Y 1 , …, Y K ), where X and A are as defined previously, and Y 1 , …, Y K are outcomes, with each outcome coded so that higher values are better. Assume there exists an unknown utility function U = u(Y 1 , …, Y K ), where u: ℝ K ℝ, such that u(y 1 , …, y K ) quantifies the "goodness" of the outcome vector (y 1 , …, y K ). As before, let U*(d) be the potential utility under a treatment regime d. Let d U opt be the optimal regime for the utility defined by u, i.e., EU* d U opt ≥ EU*(d) for any regime d. The goal is to estimate the utility function and the associated optimal regime in the presence of more than two outcomes.
To begin, we assume that the utility function is constant across patients and takes the form Define Q Y k (x, a) = E Y k X = x, A = a , for k = 1, …, K. Define also Q ω (x, a) = E u Y 1 , …, Y K ; ω X = x, A = a and note that Q ω (x, a) = ∑ k = 1 K − 1 ω k Q Y k (x, a) + 1 − ∑ k = 1 K − 1 ω k Q Y K (x, a). The Q-functions for each outcome can be estimated from the observed data using regression models. Let Q Y k , n (x, a) denote an estimator for Q Y k (x, a). Then, an estimator for Q ω (x, a) is Q ω, n (x, a) = ∑ k = 1 K − 1 ω k Q Y k , n (x, a) + 1 − ∑ k = 1 K − 1 ω k Q Y K , n (x, a). For any fixed ω, we can compute an estimator for d ω opt as d ω, n (x) = arg max a ∈ A Q ω, n (x, a). The pseudo-likelihood for a vector β and a vector ω = (ω 1 , …, ω K−1 ). For K = 2, this reduces to the formulation in Section 2. Estimators for β and ω 1 , …, ω K−1 can be obtained by maximizing the pseudolikelihood in (14). Letting ω n = ω 1, n , …, ω K − 1, n denote the maximum pseudo-likelihood estimator for ω, an estimator for the optimal regime is d ω n , n (x) = arg max a ∈ A Q ω n , n (x, a).
When the number of outcomes is large, maximizing (2) using the grid search proposed in Section 2.1 is infeasible. However, we can use the Metropolis algorithm similar to that proposed for a patient-specific utility function. A patient-specific utility function can be accommodated by setting u y 1 , …, y K ; x, θ = ∑ k = 1 K − 1 expit x ⊤ θ k y 1 + 1 − ∑ k = 1 K − 1 expit x ⊤ θ k y K for a vector θ = θ 1 ⊤ , …, θ K − 1 ⊤ ⊤ and maximizing the pseudo-likelihood using the Metropolis algorithm.
We set ω 1 = 0.2, ω 2 = 0.4, and ρ = 0.6, 0.8. Table 12 contains parameter estimates averaged across 500 replications along with standard deviations (in parentheses) across replications. The error rate is the proportion of samples in a testing set that were assigned the optimal treatment by the estimated policy. Table 13 contains estimated values (calculated by generating an independent testing set following the same generative model but with decisions made according to each policy) of the optimal policy, a policy where the utility function is estimated (the proposed method), policies estimated to maximize each outcome individually, and standard of care. The proposed method results in values close to the true optimal in large samples and larger than maximizing each individual outcome across sample sizes. Estimation results for simulations where utility and probability of optimal treatment are fixed, with three outcomes.  Value results for simulations where utility and probability of optimal treatment are fixed, with three outcomes. and the standard of care. The proposed method produces comparable results regardless of whether the utility function is misspecified or not.  Table 15 contains results for the same model misspecification, but where the true utility function is ω(x; θ) = expit 1 + 4x 1 2 + x ⊤ θ 0 with θ 0 = (−0.5, 0, 0, 1, 4) ⊤ , i.e., the coefficients for the components that are left out of the misspecified model are larger. When the coefficients of the components left out of the utility function model are larger, the proposed method produces better results when the model is correctly specified. However, even in the presence of model misspecification, the proposed method produces results that improve upon the standard of care.  In the next simulation, let ω(x; θ) = expit(1 + x T θ 0 ) with θ 0 = (−0.5, 0, 0, 1, 0.5) and the true probability of assigning the optimal treatment is Pr A = d ω opt (x) X = x = expit 0.5 + x 1 when the misspecified model is assumed as a constant. In Table 17, it is noticeable that the difference between the two estimated values is similar as in Table 16. For all settings above, the estimated value of the correct model is similar to the estimated value of the models with both the misspecified utility functions and the misspecified probability of optimal treatment. Appendix E: Relationship between the probability of assigning the optimal treatment and the variance of the estimator of the utility  In this section, we explore the relationship between the probability of assigning the optimal treatment and the variance of the estimator of the utility, which was mentioned in Remark 15. We empirically examine how the standard error of the estimator of the utility changes as Pr A = d ω opt (x) X = x varies. For simplicity, we focus on the fixed utility setting described in 4.1. Let X, Y and Z be generated as in 4.1. For each ρs, we estimate ω when ω = 0.25 and ω = 0.75. Table 18 and Table 19 display ω and s . e . (ω) across values of n, ω, and ρ. As ρ deviates from 0.5 (so that β is not close to 0), ω is closer to the true ω, and the standard error of ω decreases. Moreover, if ω 1 and ω 2 satisfy ω 1 = 1 − ω 2 , their estimates and standard errors are similar as logit(ω 1 ) = −logit(ω 2 ).
Appendix F: Performance of Metropolis optimizer in the algorithm with the patient-specific utility. In this section, we examine the performance of the Metropolis optimizer that is used in 2.2. We randomly select 10000 points in the unit ball around θ 0 , and use the point that yields the best likelihood as the reference for the comparison. We define θ by Metropolis algorithm as θ MH , and a best point from a unit ball around θ 0 as θ Uball . We obtain the distance between θ 0 and θ MH and the distance between θ 0 and θ Uball for each of 500 replications.
In Table 20, absolute difference-MH was calculated as the mean of θ MH − θ 0 , and absolute difference-Unit ball was calculated as the mean of θ Uball − θ 0 for 500 replications.
Similarly, Distance-MH was calculated as the mean of ∑ j = 1 6 θ MH, j − θ 0, j 2 , and Distance-Unit ball was calculated as the mean of ∑ i = 1 6 θ Uball , j − θ 0, j 2 for 500 replications. The standard errors of these distances are in the parentheses.
By Table 20, it is noticeable that both absolute difference and the Euclidean distance of θ MH are smaller, and we could conclude that it is reasonable to use Metropolis algorithm when estimating a patient-specific utility.  Estimation results for simulations where utility and probability of optimal treatment are fixed.  Estimation results for simulations where utility is fixed and probability of optimal treatment is variable.  Estimation results for simulations where both utility and probability of optimal treatment are variable.     Results of analysis of STEP-BD data for SUM-D and Side effect score.